Phishing Link Detection Machine Learning¶

Jonathan Christyadi (502705) - AI Core 02

This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.

Dataset: https://data.mendeley.com/datasets/c2gw7fy2j4/3

In [ ]:
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__)     # 1.1.3
print("pandas version:", pd.__version__)            # 1.5.1
print("seaborn version:", seaborn.__version__)          # 0.12.1
scikit-learn version: 1.4.1.post1
pandas version: 2.2.1
seaborn version: 0.13.2

📦 Data provisioning¶

After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.

In [ ]:
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 phishing
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 phishing
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 phishing
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 legitimate
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 legitimate

5 rows × 87 columns

In [ ]:
# Taking a look at the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19431 entries, 0 to 19430
Data columns (total 87 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   id                          19431 non-null  object
 1   url                         19431 non-null  object
 2   url_length                  19431 non-null  object
 3   hostname_length             19431 non-null  object
 4   ip                          19431 non-null  object
 5   total_of.                   19431 non-null  object
 6   total_of-                   19431 non-null  object
 7   total_of@                   19431 non-null  object
 8   total_of?                   19431 non-null  object
 9   total_of&                   19431 non-null  object
 10  total_of=                   19431 non-null  object
 11  total_of_                   19431 non-null  object
 12  total_of~                   19431 non-null  object
 13  total_of%                   19431 non-null  object
 14  total_of/                   19431 non-null  object
 15  total_of*                   19431 non-null  object
 16  total_of:                   19431 non-null  object
 17  total_of,                   19431 non-null  object
 18  total_of;                   19431 non-null  object
 19  total_of$                   19431 non-null  object
 20  total_of_www                19431 non-null  object
 21  total_of_com                19431 non-null  object
 22  total_of_http_in_path       19431 non-null  object
 23  https_token                 19431 non-null  object
 24  ratio_digits_url            19431 non-null  object
 25  ratio_digits_host           19431 non-null  object
 26  punycode                    19431 non-null  object
 27  port                        19431 non-null  object
 28  tld_in_path                 19431 non-null  object
 29  tld_in_subdomain            19431 non-null  object
 30  abnormal_subdomain          19431 non-null  object
 31  nb_subdomains               19431 non-null  object
 32  prefix_suffix               19431 non-null  object
 33  random_domain               19431 non-null  object
 34  shortening_service          19431 non-null  object
 35  path_extension              19431 non-null  object
 36  nb_redirection              19431 non-null  object
 37  nb_external_redirection     19431 non-null  object
 38  length_words_raw            19431 non-null  object
 39  char_repeat                 19431 non-null  object
 40  shortest_words_raw          19431 non-null  object
 41  shortest_word_host          19431 non-null  object
 42  shortest_word_path          19431 non-null  object
 43  longest_words_raw           19431 non-null  object
 44  longest_word_host           19431 non-null  object
 45  longest_word_path           19431 non-null  object
 46  avg_words_raw               19431 non-null  object
 47  avg_word_host               19431 non-null  object
 48  avg_word_path               19431 non-null  object
 49  phish_hints                 19431 non-null  object
 50  domain_in_brand             19431 non-null  object
 51  brand_in_subdomain          19431 non-null  object
 52  brand_in_path               19431 non-null  object
 53  suspecious_tld              19431 non-null  object
 54  statistical_report          19431 non-null  object
 55  nb_hyperlinks               19431 non-null  object
 56  ratio_intHyperlinks         19431 non-null  object
 57  ratio_extHyperlinks         19431 non-null  object
 58  ratio_nullHyperlinks        19431 non-null  object
 59  nb_extCSS                   19431 non-null  object
 60  ratio_intRedirection        19431 non-null  object
 61  ratio_extRedirection        19431 non-null  object
 62  ratio_intErrors             19431 non-null  object
 63  ratio_extErrors             19431 non-null  object
 64  login_form                  19431 non-null  object
 65  external_favicon            19431 non-null  object
 66  links_in_tags               19431 non-null  object
 67  submit_email                19431 non-null  object
 68  ratio_intMedia              19431 non-null  object
 69  ratio_extMedia              19431 non-null  object
 70  sfh                         19431 non-null  object
 71  iframe                      19431 non-null  object
 72  popup_window                19431 non-null  object
 73  safe_anchor                 19431 non-null  object
 74  onmouseover                 19431 non-null  object
 75  right_clic                  19431 non-null  object
 76  empty_title                 19431 non-null  object
 77  domain_in_title             19431 non-null  object
 78  domain_with_copyright       19431 non-null  object
 79  whois_registered_domain     19431 non-null  object
 80  domain_registration_length  19431 non-null  object
 81  domain_age                  19431 non-null  object
 82  web_traffic                 19431 non-null  object
 83  dns_record                  19431 non-null  object
 84  google_index                19431 non-null  object
 85  page_rank                   19431 non-null  object
 86  status                      19431 non-null  object
dtypes: object(87)
memory usage: 12.9+ MB
In [ ]:
# Sampling the dataset
df.sample(10)
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
61 61 https://en.wikipedia.org/wiki/Switched_at_Birt... 58 16 0 2 0 0 0 0 ... 0 zero 0 902 7133 12 0 0 7 legitimate
17573 9572 http://outlook-webapp-portal.el.r.appspot.com/... 54 38 1 5 2 0 0 0 ... 1 1 0 228 5616 0 0 1 5 phishing
9929 1928 https://www.scilearn.com/ 25 16 1 2 0 0 0 0 ... 1 1 0 219 8914 74155 0 0 5 legitimate
2936 2936 http://www.whatsapps-invites.zzux.com/ 38 30 0 3 1 0 0 0 ... 1 one 0 116 7189 481145 1 1 1 phishing
1291 1291 http://support-appleld.com.secureupdate.duilaw... 76 50 1 4 1 0 0 0 ... 1 zero 0 14 4003 5816617 0 1 0 phishing
15484 7483 http://www.davidcourtemarche.com/image/ 39 25 1 2 0 0 0 0 ... 0 1 0 137 959 0 0 1 0 phishing
14868 6867 http://caspianglobalservices.com/awosoke/fud/f... 69 25 1 2 0 0 0 0 ... 1 0 0 217 1975 0 0 1 0 phishing
16893 8892 https://jabkzahrimasjoun.blogspot.com/ 38 29 1 2 0 0 0 0 ... 1 1 0 373 7296 0 0 1 5 phishing
4873 4873 http://calzados32.webcindario.com/app/facebook... 268 26 0 3 0 0 1 1 ... 1 zero 0 952 7083 17964 0 1 3 phishing
18156 10155 http://www.cassa7c.com/boa/boa/index.html 41 15 1 3 0 0 0 0 ... 1 0 0 237 128 0 1 1 0 phishing

10 rows × 87 columns

Preprocessing¶

🆔 Encoding¶

After understanding the data on the sample, I found that some data are not in a good form and there is a room for improvement, such as the domain_with_copyright and status columns.

In [ ]:
df['status'].unique()
Out[ ]:
array(['phishing', 'legitimate'], dtype=object)

As you can see on the status column, there are only 2 values, phishing and legitimate. Which mean I can transform it into binary values (0 and 1).

In [ ]:
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 1
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 1
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 1
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 0
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 0

5 rows × 87 columns

After a closer look, I spotted that there are some inconsistencies with the value on domain_with_copyright column, for example One and one. Similarly, I want to transform it into binary value 0 and 1, instead of the string

In [ ]:
df['domain_with_copyright'].unique()
Out[ ]:
array(['one', 'zero', 'One', 'Zero', '1', '0'], dtype=object)
In [ ]:
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
Out[ ]:
array([1, 0])

Checking null or NaN values¶

In [ ]:
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
In [ ]:
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
Out[ ]:
0
In [ ]:
# Finding columns with binary values

def count_binary_columns(df):
    results = []
    counter = 0
    for col in df.columns:
        counter += 1
        if df[col].isin([0, 1]).all():
            results.append(col)
    return results, counter


count_binary_columns(df)
Out[ ]:
(['domain_with_copyright', 'status'], 87)
In [ ]:
df = df.drop(columns=['id', 'url'])
df.head()
Out[ ]:
url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& total_of= total_of_ ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 46 20 0 3 0 0 1 0 1 0 ... 1 1 0 627 6678 78526 0 0 5 1
1 128 120 0 10 0 0 0 0 0 0 ... 1 0 0 300 65 0 0 1 0 1
2 52 25 0 3 0 0 0 0 0 0 ... 1 0 0 119 1707 0 0 1 0 1
3 21 13 0 2 0 0 0 0 0 0 ... 1 1 0 130 1331 0 0 0 0 0
4 28 19 0 2 0 0 0 0 0 0 ... 0 0 0 164 1662 312044 0 0 4 0

5 rows × 85 columns

In [ ]:
df['whois_registered_domain'].unique()
Out[ ]:
array(['0', '1'], dtype=object)
In [ ]:
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')    
status
0    9716
1    9715
Name: count, dtype: int64
Out[ ]:
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>

💡 Feature selection¶

A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.

Data Visualization¶

First, to determine which feature to be used on the model, I want to visualize the correlation of the features.

Creating a heatmap to visualize the correlation between the features¶

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)

Sorting the feature correlation values¶

In [ ]:
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
Out[ ]:
<Axes: title={'center': 'Correlation with the target variable'}>

#

In [ ]:
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
Out[ ]:
url_length hostname_length total_of. total_of- total_of? total_of/ total_of_www ratio_digits_url phish_hints nb_hyperlinks domain_in_title domain_with_copyright google_index page_rank status
status 0.244348 0.240681 0.205302 -0.102849 0.293920 0.240892 -0.444561 0.356587 0.337287 -0.341295 0.339519 -0.175469 0.730684 -0.509761 1.000000
google_index 0.233061 0.216919 0.208764 -0.018285 0.202097 0.289212 -0.357215 0.323157 0.279906 -0.269482 0.265933 -0.144499 1.000000 -0.386721 0.730684
ratio_digits_url 0.434626 0.171761 0.224194 0.110341 0.325739 0.206925 -0.211165 1.000000 0.096967 -0.128915 0.152393 -0.027357 0.323157 -0.181489 0.356587
domain_in_title 0.124224 0.218850 0.108442 0.009843 0.092191 0.088462 -0.178402 0.152393 0.125857 -0.217548 1.000000 0.076105 0.265933 -0.332742 0.339519
phish_hints 0.332000 -0.019901 0.168765 0.065562 0.208052 0.501321 -0.090812 0.096967 1.000000 -0.112423 0.125857 -0.066130 0.279906 -0.203464 0.337287
total_of? 0.523172 0.164129 0.353133 0.035958 1.000000 0.243749 -0.115337 0.325739 0.208052 -0.112604 0.092191 -0.046123 0.202097 -0.123151 0.293920
url_length 1.000000 0.217586 0.447198 0.406951 0.523172 0.486490 -0.067973 0.434626 0.332000 -0.098101 0.124224 -0.004281 0.233061 -0.099900 0.244348
total_of/ 0.486490 -0.061203 0.242216 0.204793 0.243749 1.000000 -0.005628 0.206925 0.501321 -0.073183 0.088462 -0.023213 0.289212 -0.113861 0.240892
hostname_length 0.217586 1.000000 0.406834 0.059480 0.164129 -0.061203 -0.130991 0.171761 -0.019901 -0.104614 0.218850 0.073107 0.216919 -0.160621 0.240681
total_of. 0.447198 0.406834 1.000000 0.049303 0.353133 0.242216 0.068290 0.224194 0.168765 -0.093994 0.108442 0.057320 0.208764 -0.098752 0.205302
total_of- 0.406951 0.059480 0.049303 1.000000 0.035958 0.204793 0.045756 0.110341 0.065562 -0.004513 0.009843 0.020914 -0.018285 0.104676 -0.102849
domain_with_copyright -0.004281 0.073107 0.057320 0.020914 -0.046123 -0.023213 0.087826 -0.027357 -0.066130 0.192159 0.076105 1.000000 -0.144499 0.057127 -0.175469
nb_hyperlinks -0.098101 -0.104614 -0.093994 -0.004513 -0.112604 -0.073183 0.114259 -0.128915 -0.112423 1.000000 -0.217548 0.192159 -0.269482 0.221066 -0.341295
total_of_www -0.067973 -0.130991 0.068290 0.045756 -0.115337 -0.005628 1.000000 -0.211165 -0.090812 0.114259 -0.178402 0.087826 -0.357215 0.110745 -0.444561
page_rank -0.099900 -0.160621 -0.098752 0.104676 -0.123151 -0.113861 0.110745 -0.181489 -0.203464 0.221066 -0.332742 0.057127 -0.386721 1.000000 -0.509761
In [ ]:
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
Out[ ]:
status                   1.000000
google_index             0.730684
ratio_digits_url         0.356587
domain_in_title          0.339519
phish_hints              0.337287
total_of?                0.293920
url_length               0.244348
total_of/                0.240892
hostname_length          0.240681
total_of.                0.205302
total_of-               -0.102849
domain_with_copyright   -0.175469
nb_hyperlinks           -0.341295
total_of_www            -0.444561
page_rank               -0.509761
Name: status, dtype: float64
In [ ]:
# List the features from the previous step into a list
selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
# selected_features = sorted_corr['status'].head(num_features).index.tolist()
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')

# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)

# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)

# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index               int64
ratio_digits_url         float64
domain_in_title            int64
phish_hints                int64
total_of?                  int64
url_length                 int64
total_of/                  int64
hostname_length            int64
total_of.                  int64
total_of-                  int64
domain_with_copyright      int32
nb_hyperlinks              int64
total_of_www               int64
page_rank                  int64
dtype: object
int64
Out[ ]:
google_index ratio_digits_url domain_in_title phish_hints total_of? url_length total_of/ hostname_length total_of. total_of- domain_with_copyright nb_hyperlinks total_of_www page_rank status
0 0 0.108696 1 0 1 46 3 20 3 0 1 143 1 5 1
1 1 0.054688 1 2 0 128 3 120 10 0 0 0 0 0 1
2 1 0.000000 1 0 0 52 4 25 3 0 0 3 1 0 1
3 0 0.142857 1 0 0 21 3 13 2 0 1 404 1 0 0
4 0 0.000000 0 0 0 28 3 19 2 0 0 57 1 4 0
In [ ]:
# Count the number of binary columns in the selected features

features_binary = count_binary_columns(df[selected_features])
features_binary
Out[ ]:
(['google_index', 'domain_in_title', 'domain_with_copyright'], 14)
In [ ]:
from sklearn.preprocessing import StandardScaler
# Scale the data
selected_df = selected_df.dropna()
scaler = StandardScaler()
selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])
In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(selected_df, alpha=1, figsize=(60, 60), diagonal='hist')
plt.show()
In [ ]:
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')

# Add legends
plt.legend(title='Status', labels=['Phishing', 'Legitimate'])

# Show the plot
plt.show()
In [ ]:
target = 'status'

X = df[selected_features]
y = df[target]

🪓 Splitting into train/test¶

In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.

🧬 Modelling¶

Support Vector Machine¶

In [ ]:
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.8505273990223823
In [ ]:
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
              precision    recall  f1-score   support

           0       0.87      0.82      0.85      1939
           1       0.83      0.88      0.85      1948

    accuracy                           0.85      3887
   macro avg       0.85      0.85      0.85      3887
weighted avg       0.85      0.85      0.85      3887

Linear Regression¶

In [ ]:
# LINEAR REGRESSION

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.7019279787877617
In [ ]:
import shap

# Shap explainer initialized with the model and training data
explainer = shap.Explainer(model, X_train)

# Calculate Shap values for the predictions made on the test set
shap_values = explainer.shap_values(X_test)

# Plot the Shap values using bee swarm plot
shap.summary_plot(shap_values, X_test)
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

🏘️ K-NEAREST NEIGBOURS¶

In [ ]:
# K-NEAREST NEIGHBORS

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9194751736557757

🌲Decision Tree¶

In [ ]:
# DECISION TREE

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9279650115770517
In [ ]:
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(model, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()